Skip to content

core: timeouts for batch jobs#27803

Open
pkazmierczak wants to merge 37 commits intomainfrom
f-timeouts-for-batch-jobs
Open

core: timeouts for batch jobs#27803
pkazmierczak wants to merge 37 commits intomainfrom
f-timeouts-for-batch-jobs

Conversation

@pkazmierczak
Copy link
Copy Markdown
Contributor

@pkazmierczak pkazmierczak commented Apr 7, 2026

This changeset introcudes max_run_duration task group configuration variable
for batch and sysbatch jobs. It's enforced in the alloc runner by a max run
hook. If the timer is up:

  • tasks are killed
  • allocation is marked as complete and with:
  Client Status       = complete
  Client Description  = allocation exceeded max_run_duration
  • job status is dead.

The max_run_duration timer starts in the allocation runner regardless of task
states. That is, tasks that take longer to start than the timeout get
terminated. This is to avoid long-starting tasks to run too long.

Lifecycle tasks count to the max_run_duration timer. prestart, poststart
tasks runtime count into the overall run time of the taskgroup, and poststop
tasks will not be started if the task was terminated due to running out of its
allocated time.

Two new metrics are emitted:

  • client.allocs.max_run_duration.configured_seconds
  • client.allocs.max_run_duration.remaining_seconds

Resolves #1782
Supersedes #18456
Internal ref: https://hashicorp.atlassian.net/browse/NMD-551

@pkazmierczak pkazmierczak added theme/client theme/batch Issues related to batch jobs and scheduling theme/task lifecycle backport/2.0.x backport to 2.0.x release line labels Apr 9, 2026
@pkazmierczak pkazmierczak self-assigned this Apr 9, 2026
Comment thread client/allocrunner/tasklifecycle/max_run_duration.go Outdated
@pkazmierczak
Copy link
Copy Markdown
Contributor Author

hey @schmichael @mismithhisler, thanks a lot for the comments! Many of the things you pointed out was messy code from the many iterations of this branch. Apologies, should've cleaned it up better. But amongst other things, I simplified the max_run_duration.go and removed all the state store stuff that didn't belong here.

I think the main issue that remains is how to approach task-level state. In its current shape, this code always waits for tasks to start. This is good, but causes issues with task lifecycle events, as @schmichael pointed out. It also means max_run_duration is ineffective in situations where tasks take too long to start, which I believe is one of the major reasons why people wanted this feature. The usecase you mentioned, @schmichael, about big artifacts that are slow to download etc., is a common pain point I think.

After doodling a bit with a more sophisticated solution, I am slowly leaning towards a more brutal one: make the timer start immediately in the alloc runner, regardless of task state. What do you think about this?

ar.taskCoordinator.TaskStateUpdated(states)

// Get the client allocation
calloc := ar.clientAlloc(states)
Copy link
Copy Markdown
Member

@mismithhisler mismithhisler Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your point about this being tough because we have to wait for tasks to start. Doesn't this function do all the ugly "check to make sure all tasks are running" logic?

I'm curious if we can just pass the calloc.ID and calloc.ClientStatus to your MaxRunDuration object and have it just do the "yeah this alloc is now running, start a timer" logic

@pkazmierczak
Copy link
Copy Markdown
Contributor Author

#27827 (branches off of this branch) implements the simplified version of max_run_duration that disregards task states. Consider the following jobspec:

job "maxrun" {
  type = "batch"

  group "maxrun" {

    max_run_duration = "10s"

    reschedule {
      attempts  = 15
      max_delay = "10s"
      unlimited = false
    }

    task "maxrun" {
      driver = "raw_exec"

      config {
        command = "/bin/sleep"
        args    = ["1000m"]
      }
    }

    task "maxrun2" {
      driver = "raw_exec"

      config {
        command = "/bin/sleep"
        args    = ["2s"]
      }
    }
  }

  group "huge_docker_image" {

    max_run_duration = "2s"
    task "prometheus" {
      driver = "docker"

      config {
        image = "prom/prometheus"
      }
    }
  }
}

What happens when we run it on f-timeouts-for-batch-jobs-no-task-states is:

  • maxrun group runs for exactly 10s. its maxrun gets killed after 10s, and maxrun2 task finishes successfully.
  • huge_docker_image group doesn't even get to start its prometheus task because it takes more than 2s to download the image.

I am becoming more and more convinced this is a good direction.

@schmichael
Copy link
Copy Markdown
Member

After doodling a bit with a more sophisticated solution, I am slowly leaning towards a more brutal one: make the timer start immediately in the alloc runner, regardless of task state. What do you think about this?

^ + your followup comment seem exciting to me. I'm EOD here but will review the other PRs tomorrow (and check out your internal demo recording).

I can't think of a reason one approach would be more surprising than the other (to try to let least astonishment make our decision for us).

@pkazmierczak
Copy link
Copy Markdown
Contributor Author

pkazmierczak commented Apr 16, 2026

I can't think of a reason one approach would be more surprising than the other

Working on this problem I've been trying to explore "modular" solutions. In my mind, we could easily ship a "1.0" version of this that keeps things very simple, starting the timer in the allocrunner, having no regard for post-stop tasks (ok maaaybe this could be a switch), and just giving users the option to set timeouts for their task groups. We can see how the response from the community is, and later offer a more fine-grained set of knobs with a task-level max_run_duration, exact behavior of which we can decide on later, while keeping the tg-level setting more "coarse-grained" if that makes sense.

I think it's the interaction between the tg-level and task-level setting that's the hardest part to get right in this feature.

@schmichael
Copy link
Copy Markdown
Member

The CLI and UI should display the deadline in alloc status and maybe job status. We can always create followup issues for that though if we don't want to clutter this PR.

When updating unified-web-docs we should also make sure to link from periodic to this so that users know they have the option of ensuring a previous run is killed before a new run would be scheduled.

@pkazmierczak
Copy link
Copy Markdown
Contributor Author

We can always create followup issues for that though if we don't want to clutter this PR.

yeah I'd rather this being separate PRs if you don't mind? I need more commits on main, too. And of course I'll follow up with docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/2.0.x backport to 2.0.x release line theme/batch Issues related to batch jobs and scheduling theme/client theme/task lifecycle

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature] Timeout for batch jobs

3 participants